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1.  INTRODUCTION 


Current  solutions  rely  on  a  combination  of  intrusion  detection  systems  (IDS),  an  intrusion  prevention 
system  (IPS),  and  security  infomiation  and  event  management  (SIEM)  technologies  to  identify  cyber 
threats  to  network  systems  based  on  a  host  of  physical  and  virtual  network  sensors.  These  traditional 
IDS/IPS  and  SIEM  cyber  security  solutions  often  generate  large  sets  of  log  data  that  can  hide  detected 
threats.  As  networks  grow  and  the  magnitude  of  generated  alarms  increases,  analysts  are  faced  with  both 
big  data  storage  and  access  problems.  Access  times  become  an  issue.  Analysts  cannot  prioritize  threats 
and  examine  reoccuning  threats.  Figure  1  shows  an  IDS  alarm  generated  by  Snort®’,  an  open-source  IDS 
used  for  network  alann  generation. 


Algorithms,  such  as  k-means  clustering  and  support  vector  machines  (SVM),  can  reduce  the  number  of 
threats  to  a  manageable  level.  With  a  manageable  list  of  events,  network  value  assets  using  a  graph 
database,  and  an  SVM  to  monitor  behaviors  long  term,  one  can  produce  a  system  that  reduces  infonnation 
overload.  This  system  will  leverage  numerous  established  technologies  and  provide  a  better  situational 
awareness  of  the  monitored  cyber  system. 
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Figure  1.  Snort® alarm. 
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2.  SYSTEM  ARCHITECTURE 


The  CyberlA  system  architecture  is  built  on  a  conventional  client  server  model.  The  client  can  access 
the  service  through  a  Web  page.  The  server  handles  the  user  request.  In  this  case,  the  server  handles  the 
start  and  stop  time  of  the  log  minimization  system.  Figure  2  shows  the  client-server  architecture. 

For  CyberlA,  an  open-source  C++  Web  development  framework  was  selected.  This  framework  was 
selected  due  to  its  capability  of  handling  high  loads.  This  competency  is  achieved  by  using  a  modem  C++ 
as  the  development  language  designed  to  develop  both  websites  and  Web  services.  The  framework  allows 
seamless  integration  of  C++  libraries,  thus  providing  the  ideal  framework  for  developing  and  integrating 
different  high-performance  C/C++  algorithms.  This  capability  is  significant  because  NVIDIA®  CUD  A™ 
architecture  utilizes  C/C++  coding  to  exploit  the  processing  power  of  graphics  processing  units  (GPUs)  to 
port  highly  parallelizable  and  computationally  expensive  code  to  the  graphical  processing  unit  (GPU)  for 
processing. 


Figure  2.  Conventional  client-server  Web  architecture. 
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3.  DATABASE  AND  SERVICES 


To  generate  test  data,  the  project  team  used  the  Defense  Advanced  Research  Projects  Agency’s 
(DARPA)  1999  network  intrusion  data  set  (freely  available)  and  labeled  attacks.  Snort®  processed  the 
DARPA  network  packet  capture  (pcap)  data.  Using  Bamyard2,  an  open-source  interpreter  for  Snort8 
unified2  binary  output  files,  the  binary  data  parsing  and  storage  to  disk  is  separated  to  another  process  that 
will  not  allow  Snort  8  to  miss  network  traffic.  Alanns  are  saved  into  the  commonly  used  open-source 
MySQL8  database.  Figure  3  describes  the  data  flow  process. 


Figure  3.  CyberlA  data  flow  process. 


The  generated  database  is  then  processed  using  the  first-phase  k-means  clustering.  Results  of  clusters 
are  further  processed  by  a  supervised  machine-learning  algorithm,  SVM,  which  will  binary  classify 
alanns  to  minimize  the  false  positive  alanns.  Results  are  then  displayed  to  the  user.  Figures  4  and  5  also 
show  (from  a  top  and  detailed  level)  the  Snort  8  database  schema  used  for  data  access. 
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Figure  4.  Top-level  Snort®  database  schema. 
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Figure  5.  Detailed  Snort®  database  schema. 


3.1  PHASE-ONE  ALGORITHM 

Initially,  CyberlA  focused  on  the  use  of  an  unsupervised  machine-learning  algorithm  to  support  the 
clustering  of  data,  specifically  self-organizing  maps  (SOMs);  however,  tests  revealed  that  nonnalizing 
data  to  support  a  faster  calculation  produces  a  poorly  clustered  SOM.  This  method  also  made  centroid 
determination  more  difficult.  We  were  mapping  from  a  three-dimensional  (3-D)  space  to  a  visual  two- 
feature  representation;  the  three  features  were  time,  source  (Internet  Protocol)  IP,  and  destination  IP.  With 
low-space  mapping,  clusters  were  identified  with  traditional  image  processing  techniques.  Nonnalizing 
data  led  to  a  loss  of  fidelity  required  for  cyber  forensics  and  the  network  graph  database.  As  a  result, 
CyberlA  moved  away  from  using  a  SOM  for  the  first  phase  to  a  k-means  algorithm.  The  k-means  offered 
a  better  perfonnance  when  an  adequate  value  of  k  clusters  than  previously  selected. 

The  parameters  for  the  k-means  clustering  are  as  follows: 

k  =  numAlarms/(time  window*25)  (1) 

theta  =  2*numAlarms 
threshold  =  0.0002 
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3.2  PHASE-TWO  ALGORITHM 

An  open-source  C++  support  vector  machine  implementation  is  set  for  use  with  binary  classification  to 
reduce  the  number  of  alanns  presented  to  the  user.  We  are  currently  adjusting  the  application  to  work  with 
both  the  Snort®  IDS  alann  data  and  schema.  The  project  team  will  also  investigate  and  develop  a  good 
process  for  network  baseline  needs.  This  effort  will  continue  in  Fiscal  Year  (FY)  2016. 
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4.  IMPLEMENTATION  AND  EVALUATION 


In  FY 15,  the  project  team  produced  a  complete  framework  and  an  integrated  first-phase  clustering 
algorithm.  Table  1  provides  the  k-means  processing  time  for  various  24-hour  attack  windows  from  the 
Snort '  processed  DARPA  network  pcap.  Figure  6  shows  the  (almost)  linear  increase  associated  with  the 
number  of  items  and  the  clustering  algorithm.  The  rise  was  expected  as  the  calculation  for  k  value  was 
adjusted  based  on  the  number  of  alarms,  and  with  an  increase  in  k  value,  the  processing  time  increased. 


Table  1.  K-means  processing  time  using  24-hour  Snort®  alarm  window. 


Test# 

#  IDS 
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K-means  Processing 
Time  (ms) 
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6,027 
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11,076 
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8,625 

13 
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12,539 

14 

35765 

12,138 

15 

51547 

25,536 

Figure  6.  The  number  of  IDS  alarms  vs.  the  k-means  clustering  processing  time. 
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Timestamp,  source  IP,  and  destination  IP  (three  selected  features)  have  distinct  meanings  when  using 
randomly  generated  centroids.  Based  on  the  k  equation  provided,  we  can  further  reduce  the  number  of  k 
clusters  and  apply  them  more  effectively  to  cluster  attack  scenarios,  which  would  reduce  the  process  time. 
Figure  7  shows  the  implemented  Web  front  end  and  the  end-to-end  system  framework. 
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Figure  7.  Completed  CyberlA  system  Web  access  graphical  user  interface  (GUI). 
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5.  CONCLUSION 


In  this  technical  document  we  presented  the  proof-of-concept  (POC)  CyberlA  system,  a  data-driven 
intrusion  detection  log  analysis  tool  capable  of  processing  thousands  of  logs.  CyberlA  makes  use  of  a 
k-means  clustering  algorithm  developed  in  house.  The  algorithm  has  integrated  database  access  and  a 
complete  Web  framework  capable  of  integrating  other  C/C++  algorithms  developed  in  house.  The  system 
allows  for  the  ease  of  GPU  integration  to  reduce  the  processing  time,  thus  allowing  the  process  of  large 
data  sets  in  real  time. 
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6.  FUTURE  WORK 


In  FY16,  CyberlA  development  will  continue.  The  project  team  will  focus  on  the  centroid  initialization 
for  k-means  clustering.  Since  IPs  are  distinct,  using  a  randomly  generated  centroid  may  not  be  optimal 
during  clustering.  Without  randomly  selecting  centroids,  k-means  clustering  becomes  a  detenninistic 
system.  This  will  benefit  users  and  simplify  forensic  analysis.  The  user  is  presented  with  consistent 
clusters  when  using  the  same  parameters. 

With  the  completion  of  the  proof-of-concept  framework,  we  can  integrate  the  second  phase  supervised 
machine-learning  algorithm  and  tuning  using  the  recent  (and  available)  network  data  from  the  University 
of  New  Brunswick  Infonnation  Security  Centre  of  Excellence  (ISCX).  Network  data  are  labeled  and 
contain  more  recent  and  complex  cyber  exploitations.  The  k-means  algorithm  for  big  data  scalability  is  set 
for  detailed  timing  analysis.  We  can  port  many  algorithms  (developed  in  house)  onto  the  GPU  to  decrease 
processing  time  using  the  detailed  timing  analysis.  The  Davies-Bouldin  Index  (DBI)  can  help  assess 
k-means  clustering.  To  facilitate  the  mission  impact  assessment  portion  of  the  CyberlA  framework,  we 
will  use  the  integration  of  a  global  network  graph  database  for  alarms. 
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